Libraries I may use called
library(tidyverse)
# install for visualizations
library(ggplot2)
# install to combine date and time
library(lubridate)
# for melting a df
library(reshape)
library(data.table)
Perceived Health Status
Reading in the first dataset, perceived health status.
perceived_health_status <- read_csv("../data/perceived_health_status.csv")
# dnmk <- read_csv("../data/denmark_only_phs.csv")
# perceived_health_status_once <- read_csv("../data/perceived_health_status_once.csv")
Inspecting the denmark only data:
# dnmk %>%
# filter(Sex == "Total")
It appears that then
Inspecting the perceived health data:
perceived_health_status %>%
filter(`Reference area` == "Denmark")
Inspect original perceived health data:
# perceived_health_status_once%>%
# filter(`Reference area` == "Denmark")
perceived_health_status
Filtering.
perceived_health_status_stripped <- perceived_health_status %>%
filter(TIME_PERIOD == 2022) %>%
filter(REF_AREA == "AUT") %>%
filter(Sex == "Total") %>%
filter(Age == "15 years or over")
perceived_health_status_stripped
NA
As it appears that after 2007, the number of observations are more
significant in number, I will limit my data to 2007 and later. But,
since it appears the number of observations drops off in 2024, I will
limit my data to a range of 2007-2023. As well, I want to capture all
genders and ages. (Corrected “==” to be “%in%” and that corrected much
the problem I was running into.)
# input code to limit year range, age range, and gender in perceived health status data
phs <- perceived_health_status %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(Sex == "Total") %>%
filter(Age == "15 years or over")
phs
NA
It turns out that Denmark’s data comes only from a population of
15-24 year old, which would be excluded when I filter for age range
collective/Total. (Corrected “==” to be “%in%” which corrected what I
was seeing.)
# perceived_health_status_once %>%
# filter(TIME_PERIOD %in% c(2007:2023)) %>%
# filter(Sex == "Total") %>%
# filter(AGE == "Y_GE15")
# sort(unique(perceived_health_status_once$Age))
#
# perceived_health_status_once %>%
# group_by(AGE) %>%
# summarize(n=n())
Inspect the data for the number of years it covers.
barplot(table(perceived_health_status$TIME_PERIOD), main = "number of observations of year in the data")

NA
NA
Inspect the data for the number of age categories it covers.
barplot(table(perceived_health_status$Age), main = "number of observations of age groups in the data")

Inspecting the data for balance in the health status column.
phs %>%
group_by(HEALTH_STATUS) %>%
summarize(n=n())
NA
Which countries are most heavily represented in the data? Denmark was
selected to be included I will download a Denmark only data and
investigate why it is no longer located in this data.
phs %>%
group_by(`Reference area`) %>%
summarize(n=n())
NA
phs %>%
group_by(AGE) %>%
summarize(n=n())
NA
The final state of my Perceived Health Status data will be:
phs
NA
Education Levels
Reading in the second dataset, education levels.
# education_level <- read_csv("../data/educational_attainment_distribution_age_gender.csv")
# education_levels_defined <- read_csv("../data/educational_attainment_distribution.csv")
education_levels_three <- read_csv("../data/educational_attainment.csv")
Early data exploration in the education data
el_third <- education_levels_three %>%
filter(Sex == "Total") %>%
filter(Age == "From 25 to 64 years") %>%
filter(TIME_PERIOD == 2010) %>%
filter(OBS_STATUS == "A") %>%
filter(REF_AREA == "AUT") %>%
filter(STATISTICAL_OPERATION == "OBS")
el_third
# ISCED11A_5T8 = Tertiary education
# ISCED11A_3_4 = Upper secondary or post-secondary non-tertiary education
# ISCED11A_0T2 = Below upper secondary education
education_levels_three
NA
Verify that there are simply three categories for the education level
attained and Education attainment level columns
sort(unique(education_levels_three$ATTAINMENT_LEV))
[1] "ISCED11A_0T2" "ISCED11A_3_4" "ISCED11A_5T8"
sort(unique(education_levels_three$`Educational attainment level`))
[1] "Below upper secondary education"
[2] "Tertiary education"
[3] "Upper secondary or post-secondary non-tertiary education"
sort(unique(education_levels_three$STATISTICAL_OPERATION))
[1] "OBS" "SE"
I want to use the observed values and not the standard error values
at this time.
education_levels_three %>%
filter(STATISTICAL_OPERATION == "OBS")
NA
I want to see the representation of countries in the data. Certain
countries are significantly under represented.
barplot(table(education_levels_three$`Reference area`), main = "number of observations of countries in the data")

barplot(table(education_levels_three$TIME_PERIOD), main = "number of observations of year in the data")

check to see which years are present in the data. Base on the lack of
data in the years prior to about 1998, I will limit the dataset to
1998-2024.
sort(unique(education_levels_three$TIME_PERIOD))
[1] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
[15] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
[29] 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
[43] 2023 2024
education_levels_three_time <- education_levels_three %>%
filter(TIME_PERIOD %in% c(1998:2024))
I would like to discover which of the countries are under
represented.
education_levels_three_time %>%
group_by(`Reference area`) %>%
summarize(n=n())
I want to drop any countries whose count is less than 1800.
# count the frequency of each country
country_count <- table(education_levels_three_time$`Reference area`)
# I do not want any fewer rows for a country than this
num_in_data <- 1800
# which countries match or exceed that number
countries_to_keep <- names(country_count[country_count > num_in_data])
# filter the dataset to keep only rows of countries who are
#represented in the data a certain number of times or more
education_levels_three_time_countries <- education_levels_three_time[education_levels_three_time$`Reference area` %in% countries_to_keep, ]
education_levels_three_time_countries
NA
NA
Check out which ages are available to filter by.
sort(unique(education_levels_three_time_countries$Age))
[1] "From 25 to 34 years" "From 25 to 64 years" "From 35 to 44 years"
[4] "From 45 to 54 years" "From 55 to 64 years"
Check number of times each age range is represented in the data
barplot(table(education_levels_three_time_countries$Age), main = "number of observations of age in the data")

NA
NA
Filter using all of the methods left that I want to filter by: all
genders, ages, as well as only observed data, not the standard
error.
el <- education_levels_three_time_countries %>%
filter(Sex == "Total") %>%
filter(Age == "From 25 to 64 years") %>%
filter(OBS_STATUS == "A") %>%
filter(STATISTICAL_OPERATION == "OBS")
el
The final state of my education level data will be:
# selected all adults from age, and all genders, and only observed values no standard error values
el[1:2,]
NA
Social and Community connection pulled from the Wellbeing
dataset
Reading in the third dataset, social connectedness from the wellbeing
dataset.
social_wellbeing <- read_csv("../data/social_connection_wellbeing.csv")
View the data.
social_wellbeing[1:2, ]
Inspect the UNIT_MEASURE column’s unique values.
sort(unique(social_wellbeing$UNIT_MEASURE))
[1] "0_TO_10" "H_WK" "PT_POP_Y_GE15"
[4] "PT_POP_Y_GE15_SUB" "PT_POP_Y_GE16" "PT_POP_Y_GE16_SUB"
sort(unique(social_wellbeing$Measure))
[1] "Feeling lonely"
[2] "Lack of social support"
[3] "Life satisfaction"
[4] "Life satisfaction score less than 5"
[5] "Satisfaction with personal relationships"
[6] "Satisfaction with personal relationships score less than 5"
[7] "Social support"
[8] "Time spent in social interactions"
social_wellbeing %>%
filter(TIME_PERIOD == 2022) %>%
filter(REF_AREA == "AUS") %>%
filter(Measure %in% c("Social support", "Lack of social support")) %>%
filter(Sex == "Total") %>%
filter(Age == "Total") %>%
filter(`Education level` == "Total")
NA
Assigning filtered dataset to new variable.
social_support_or_lack <- social_wellbeing %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(Measure %in% c("Social support", "Lack of social support")) %>%
filter(Sex == "Total") %>%
filter(Age == "Total") %>%
filter(`Education level` == "Total")
social_support_or_lack[1:2, ]
NA
barplot(table(social_support_or_lack$TIME_PERIOD), main = "frequency of years in the data")

# find unique values for Measure column
sort(unique(social_wellbeing$Measure))
[1] "Feeling lonely"
[2] "Lack of social support"
[3] "Life satisfaction"
[4] "Life satisfaction score less than 5"
[5] "Satisfaction with personal relationships"
[6] "Satisfaction with personal relationships score less than 5"
[7] "Social support"
[8] "Time spent in social interactions"
How is satisfaction with personal relationships scored?
social_wellbeing %>%
filter(Measure == "Satisfaction with personal relationships score less than 5")
Inspect which Measures are listed with the 0 to 10 unit measure.
social_wellbeing_0_10 <- social_wellbeing %>%
filter(UNIT_MEASURE == "0_TO_10")
social_wellbeing_0_10[1:2, ]
NA
sort(unique(social_wellbeing_0_10$Measure))
[1] "Life satisfaction"
[2] "Satisfaction with personal relationships"
Limit the data to totals on sex, age, and education, in order to
capture a population not a specific group within a population. As well,
narrow down the date range and select only the Measures whose unit
measure is 0-10.
satisfaction_life_rels <- social_wellbeing %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(Measure %in% c("Life satisfaction", "Satisfaction with personal relationships")) %>%
filter(Sex == "Total") %>%
filter(Age == "Total") %>%
filter(`Education level` == "Total")
satisfaction_life_rels[1:2, ]
NA
barplot(table(satisfaction_life_rels$TIME_PERIOD))

Looking into the feeling lonely measure.
lonely <- social_wellbeing %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(Measure == "Feeling lonely") %>%
filter(Sex == "Total") %>%
filter(Age == "Total") %>%
filter(`Education level` == "Total")
lonely[1:2, ]
NA
barplot(table(lonely$TIME_PERIOD), main = "how often certain years are rep'd in data")

num_countries_repd_in_lonely <- lonely %>%
group_by(`Reference area`) %>%
summarise(n=n())
ctrs_desc <- order(num_countries_repd_in_lonely$n, decreasing = TRUE)
num_countries_repd_in_lonely[ctrs_desc,][1:2, ]
NA
NA
barplot(table(social_support_or_lack$TIME_PERIOD), main = "how often each year shows up in dataset" )

social_interactions <- social_wellbeing %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(Measure == "Time spent in social interactions") %>%
filter(Sex == "Total") %>%
filter(Age == "Total") %>%
filter(`Education level` == "Total")
social_interactions[1:2, ]
NA
barplot(table(social_interactions$TIME_PERIOD))

social_interactions %>%
group_by(`Reference area`) %>%
summarise(n=n())
NA
The final datasets from wellbeing data I have chosen are as
follows:
# measured as a percentage fo the adult population and is well represented across all years in selected data
social_support_or_lack[1:2, ]
# measured as a percentage of the adult population, and is almost exclusively 2018 and 2022
lonely[1:2, ]
# years most represented in the data: 2013, 2018, 2022; it is measured on a 0-10 scale.
satisfaction_life_rels[1:2, ]
# measured in units of hours per week, across various years, very difficult to compare, I will exclude this data for now
social_interactions[1:2, ]
NA
Safety
Reading in the second dataset, education levels.
safety_regions <- read_csv("../data/safety_regions.csv")
safety <- safety_regions %>%
filter(TIME_PERIOD %in% c(2007:2023)) %>%
filter(TERRITORIAL_LEVEL == "CTRY") %>%
filter(Sex %in% c("Total", "Not applicable")) %>%
filter(`Observation status` == "Normal value")
safety[1:2, ]
NA
barplot(table(safety$OBS_VALUE), main = "frequency of percentage of pop measured unsafe instances in the data")

sort(unique(safety$`Observation status`))
[1] "Normal value"
The final dataset from safety data I have chosen are as follows:
safety[1:2,]
Final datasets are:
# perceived health status, subjective survey of good, fair, and bad health
phs[1:2, ]
# selected all adults from age, and all genders, and only observed values no standard error values
el[1:2, ]
# measured as a percentage fo the adult population and is well represented across all years in selected data
social_support_or_lack[1:2, ]
# measured as a percentage of the adult population, and is almost exclusively 2018 and 2022
lonely[1:2, ]
# years most represented in the data: 2013, 2018, 2022; it is measured on a 0-10 scale.
satisfaction_life_rels[1:2, ]
# measured in cases per 100,000 persons, well distributed across selected years
safety[1:2,]
NA
###Merging data sets together into one
Order the data’s country column (REF_AREA) alphabetically.
# Order the data's country column (`Reference area`) alphabetically.
phs_v1 <- phs[order(phs$`Reference area`),]
el_v1 <- el[order(el$`Reference area`),]
safety_v1 <- safety[order(safety$`Reference area`),]
social_support_or_lack_v1 <- social_support_or_lack[order(social_support_or_lack$`Reference area`),]
lonely_v1 <- lonely[order(lonely$`Reference area`),]
satisfaction_life_rels_v1 <- satisfaction_life_rels[order(satisfaction_life_rels$`Reference area`),]
phs_v1 # percentage of the population
el_v1 # percentage of the population
safety_v1 # measured in cases per 100,000 persons
social_support_or_lack_v1 # percentage of the population
lonely_v1 # percentage of the population
satisfaction_life_rels_v1[1:2,]
NA
rename columns
phs_v2 <- rename(phs_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_phs", `Unit of measure` = "Unit_of_measure_phs", `Health status` = "Health_status", "OBS_VALUE" = "OBS_VALUE_PHS"))
el_v2 <- rename(el_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_el", `Unit of measure` = "Unit_of_measure_el", `Educational attainment level` = "Edu_attainment_lvl", "OBS_VALUE" = "OBS_VALUE_EL"))
safety_v2 <- rename(safety_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_sfty", `Unit of measure` = "Unit_of_measure_sfty", "Measure" = "Measure_safety", "OBS_VALUE" = "OBS_VALUE_SAFETY", "MEASURE" = "MEASURE_SAFETY"))
social_support_or_lack_v2 <- rename(social_support_or_lack_v1, replace = c(`Reference area` = "Reference_area", `Unit of measure` = "Unit_of_measure_social", "Measure" = "Measure_social", "OBS_VALUE" = "OBS_VALUE_SOCIAL", "MEASURE" = "MEASURE_SOCIAL"))
lonely_v2 <- rename(lonely_v1, replace = c(`Reference area` = "Reference_area", "OBS_VALUE" = "OBS_VALUE_LONELY", "Measure" = "Measure_lonely", "MEASURE" = "MEASURE_LONELY"))
satisfaction_life_rels_v2 <- rename(satisfaction_life_rels_v1, replace = c(`Reference area` = "Reference_area", "OBS_VALUE" = "OBS_VALUE_SAT", "Measure" = "Measure_sat", "MEASURE" = "MEASURE_SAT"))
satisfaction_life_rels_v2[1:2,]
drop columns not anticipated to be needed or subset only the columns
I will require
# Columns to keep
phs_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_PHS", "Health_status", "HEALTH_STATUS")
el_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_EL", "Edu_attainment_lvl", "ATTAINMENT_LEV")
safety_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SAFETY", "Measure_safety", "MEASURE_SAFETY")
social_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SOCIAL", "Measure_social", "MEASURE_SOCIAL")
lonely_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_LONELY", "Measure_lonely", "MEASURE_LONELY")
satisfaction_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SAT", "Measure_sat", "MEASURE_SAT")
phs_v3 <- subset(phs_v2, select = phs_columns_to_keep)
el_v3 <- subset(el_v2, select = el_columns_to_keep)
safety_v3 <- subset(safety_v2, select = safety_columns_to_keep)
social_support_or_lack_v3 <- subset(social_support_or_lack_v2, select = social_columns_to_keep)
lonely_v3 <- subset(lonely_v2, select = lonely_columns_to_keep)
satisfaction_life_rels_v3 <- subset(satisfaction_life_rels_v2, select = satisfaction_columns_to_keep)
phs_v3
el_v3
safety_v3
social_support_or_lack_v3
lonely_v3
satisfaction_life_rels_v3
NA
Pivot all data sets prior to merging them.
phs_pivoted <- phs_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("HEALTH_STATUS"),
names_prefix = "health_status_",
values_from = "OBS_VALUE_PHS"
)
phs_pivoted
NA
el_pivoted <- el_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("Edu_attainment_lvl"),
values_from = "OBS_VALUE_EL"
)
el_pivoted
NA
safety_pivoted <- safety_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("Measure_safety"),
values_from = "OBS_VALUE_SAFETY"
)
safety_pivoted
NA
social_support_or_lack_pivoted <- social_support_or_lack_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("Measure_social"),
values_from = "OBS_VALUE_SOCIAL"
)
social_support_or_lack_pivoted
NA
lonely_pivoted <- lonely_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("Measure_lonely"),
values_from = "OBS_VALUE_LONELY"
)
lonely_pivoted
NA
satisfaction_life_rels_pivoted <- satisfaction_life_rels_v3 %>%
pivot_wider(
id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
names_from = c("Measure_sat"),
values_from = "OBS_VALUE_SAT"
)
satisfaction_life_rels_pivoted
NA
Merge all 6 data sets
phs_el <- merge(
phs_pivoted,
el_pivoted,
by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
all = TRUE
)
phs_el
NA
phs_el_safety <- merge(
phs_el,
safety_pivoted,
by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
all = TRUE
)
phs_el_safety
NA
phs_el_safety_social <- merge(
phs_el_safety,
social_support_or_lack_pivoted,
by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
all = TRUE
)
phs_el_safety_social
NA
phs_el_safety_social_lonely <- merge(
phs_el_safety_social,
lonely_pivoted,
by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
all = TRUE
)
phs_el_safety_social_lonely
NA
phs_el_safety_social_lonely_satisfaction <- merge(
phs_el_safety_social_lonely,
satisfaction_life_rels_pivoted,
by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
all = TRUE
)
phs_el_safety_social_lonely_satisfaction
NA
Use fill from tidyr to fill, up and down, the lonely data
phs_el_safety_social_lonely_satisfaction_fill <- phs_el_safety_social_lonely_satisfaction |>
group_by(Reference_area) |>
fill(`Feeling lonely`, .direction = "updown") |>
ungroup()
phs_el_safety_social_lonely_satisfaction_fill
NA
phs_el_safety_social_lonely_satisfaction_filled <- phs_el_safety_social_lonely_satisfaction_fill |>
group_by(Reference_area) |>
fill(`Satisfaction with personal relationships`, .direction = "updown") |>
ungroup()
phs_el_safety_social_lonely_satisfaction_filled
NA
df <- phs_el_safety_social_lonely_satisfaction_filled |>
group_by(Reference_area) |>
fill(`Life satisfaction`, .direction = "updown") |>
ungroup()
df
NA
Final version of the dataset to be used in the shiny app.
df %>% view()
# write.csv(df, file = "../data/df.csv")
hist(df$`Feeling lonely`, xlab="lonely")

sort(unique(df$`Feeling lonely`))
[1] 3.131366 3.444549 3.470850 3.500000 3.589915 3.600000 3.817022
[8] 3.857592 3.909297 4.000000 4.044682 4.068202 4.179466 4.186444
[15] 4.254773 4.286276 4.444873 4.500000 4.525158 4.800000 4.850588
[22] 4.900000 4.947824 5.105207 5.129866 5.183457 5.233685 5.259631
[29] 5.313834 5.327242 5.380740 5.401067 5.445733 5.549899 5.573010
[36] 5.658739 5.658772 5.664709 5.927277 6.095523 6.244574 6.295952
[43] 6.300000 6.400000 6.440687 6.466081 6.616622 6.659974 6.953202
[50] 7.027851 7.088179 7.137664 7.167792 7.193489 7.280727 7.592857
[57] 7.850135 8.061958 8.135317 8.136483 8.329389 8.579049 8.850272
[64] 8.921706 9.214390 9.330003 9.332626 9.348542 10.773153 12.860000
[71] 13.420000 13.850000
class(df)
[1] "tbl_df" "tbl" "data.frame"
max(df$`Feeling lonely`, na.rm = TRUE)
[1] 13.85
df %>% view()
---
title: "R Notebook"
output: html_notebook
---
Libraries I may use called
```{r}

library(tidyverse)
# install for visualizations 
library(ggplot2)
# install to combine date and time
library(lubridate)
# for melting a df
library(reshape)
library(data.table)

```
### Perceived Health Status

Reading in the first dataset, perceived health status.
 
```{r}
perceived_health_status <- read_csv("../data/perceived_health_status.csv")
# dnmk <- read_csv("../data/denmark_only_phs.csv")
# perceived_health_status_once <- read_csv("../data/perceived_health_status_once.csv")

```
Inspecting the denmark only data:

```{r}
# dnmk %>% 
#   filter(Sex == "Total")
```

It appears that then 


Inspecting the perceived health data:

```{r}
perceived_health_status %>% 
  filter(`Reference area` == "Denmark")
```

Inspect original perceived health data:
```{r}
# perceived_health_status_once%>% 
#   filter(`Reference area` == "Denmark")
```
```{r}
perceived_health_status
```


Filtering.

```{r}

perceived_health_status_stripped <- perceived_health_status %>% 
  filter(TIME_PERIOD == 2022) %>% 
  filter(REF_AREA == "AUT") %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "15 years or over")  

perceived_health_status_stripped

```
As it appears that after 2007, the number of observations are more significant in number, I will limit my data to 2007 and later. But, since it appears the number of observations drops off in 2024, I will limit my data to a range of 2007-2023. As well, I want to capture all genders and ages. (Corrected "==" to be "%in%" and that corrected much the problem I was running into.)

```{r}

# input code to limit year range, age range, and gender in perceived health status data

phs <- perceived_health_status %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "15 years or over")

phs

```
It turns out that Denmark's data comes only from a population of 15-24 year old, which would be excluded when I filter for age range collective/Total. (Corrected "==" to be "%in%" which corrected what I was seeing.)

```{r}

# perceived_health_status_once %>% 
#   filter(TIME_PERIOD %in% c(2007:2023)) %>% 
#   filter(Sex == "Total") %>% 
#   filter(AGE == "Y_GE15") 

```

```{r}

# sort(unique(perceived_health_status_once$Age))
# 
# perceived_health_status_once %>% 
#   group_by(AGE) %>%
#   summarize(n=n())

```


Inspect the data for the number of years it covers.

```{r}

barplot(table(perceived_health_status$TIME_PERIOD), main = "number of observations of year in the data")


```

Inspect the data for the number of age categories it covers.
```{r}
barplot(table(perceived_health_status$Age), main = "number of observations of age groups in the data")
```

Inspecting the data for balance in the health status column.

```{r}

phs %>% 
  group_by(HEALTH_STATUS) %>% 
  summarize(n=n())

```

Which countries are most heavily represented in the data? Denmark was selected to be included I will download a Denmark only data and investigate why it is no longer located in this data. 

```{r}

phs %>% 
  group_by(`Reference area`) %>%
  summarize(n=n())

```

```{r}
phs %>% 
  group_by(AGE) %>%
  summarize(n=n())

```

The final state of my Perceived Health Status data will be:
```{r}

phs

```



### Education Levels



Reading in the second dataset, education levels.

```{r}

# education_level <- read_csv("../data/educational_attainment_distribution_age_gender.csv")
# education_levels_defined <- read_csv("../data/educational_attainment_distribution.csv")
education_levels_three <- read_csv("../data/educational_attainment.csv")

```

Early data exploration in the education data

```{r}
el_third <- education_levels_three %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "From 25 to 64 years") %>% 
  filter(TIME_PERIOD == 2010) %>% 
  filter(OBS_STATUS == "A") %>% 
  filter(REF_AREA == "AUT") %>% 
  filter(STATISTICAL_OPERATION == "OBS")

el_third
```


```{r}

# ISCED11A_5T8 = Tertiary education
# ISCED11A_3_4 = Upper secondary or post-secondary non-tertiary education
# ISCED11A_0T2 = Below upper secondary education
education_levels_three

```
Verify that there are simply three categories for the education level attained and  Education attainment level columns
```{r}

sort(unique(education_levels_three$ATTAINMENT_LEV))
sort(unique(education_levels_three$`Educational attainment level`))

```

```{r}
sort(unique(education_levels_three$STATISTICAL_OPERATION))
```
I want to use the observed values and not the standard error values at this time. 
```{r}

education_levels_three %>% 
  filter(STATISTICAL_OPERATION == "OBS")

```

I want to see the representation of countries in the data. Certain countries are significantly under represented.

```{r}
barplot(table(education_levels_three$`Reference area`), main = "number of observations of countries in the data")

```

```{r}
barplot(table(education_levels_three$TIME_PERIOD), main = "number of observations of year in the data")

```

check to see which years are present in the data. Base on the lack of data in the years prior to about 1998, I will limit the dataset to 1998-2024.

```{r}

sort(unique(education_levels_three$TIME_PERIOD))

education_levels_three_time <- education_levels_three %>% 
  filter(TIME_PERIOD %in% c(1998:2024))

```

I would like to discover which of the countries are under represented. 

```{r}
education_levels_three_time %>% 
  group_by(`Reference area`) %>%
  summarize(n=n())
```


I want to drop any countries whose count is less than 1800. 

```{r}

# count the frequency of each country
country_count <- table(education_levels_three_time$`Reference area`)

# I do not want any fewer rows for a country than this
num_in_data <- 1800

# which countries match or exceed that number
countries_to_keep <- names(country_count[country_count > num_in_data])

# filter the dataset to keep only rows of countries who are 
#represented in the data a certain number of times or more
education_levels_three_time_countries <- education_levels_three_time[education_levels_three_time$`Reference area` %in% countries_to_keep, ]

education_levels_three_time_countries 


```

Check out which ages are available to filter by.

```{r}
sort(unique(education_levels_three_time_countries$Age))
```

Check number of times each age range is represented in the data
```{r}

barplot(table(education_levels_three_time_countries$Age), main = "number of observations of age in the data")


```


Filter using all of the methods left that I want to filter by: all genders, ages, as well as only observed data, not the standard error.

```{r}

el <- education_levels_three_time_countries %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "From 25 to 64 years") %>% 
  filter(OBS_STATUS == "A") %>%
  filter(STATISTICAL_OPERATION == "OBS")

```

```{r}
el
```


The final state of my education level data will be:

```{r}

# selected all adults from age, and all genders, and only observed values no standard error values
el[1:2,]

```






### Social and Community connection pulled from the Wellbeing dataset





Reading in the third dataset, social connectedness from the wellbeing dataset.


```{r}

social_wellbeing <- read_csv("../data/social_connection_wellbeing.csv")

```

View the data.
```{r}
social_wellbeing[1:2, ]
```

Inspect the UNIT_MEASURE column's unique values.
```{r}

sort(unique(social_wellbeing$UNIT_MEASURE))

```
```{r}

sort(unique(social_wellbeing$Measure))

```


```{r}

social_wellbeing %>% 
  filter(TIME_PERIOD == 2022) %>% 
  filter(REF_AREA == "AUS") %>% 
  filter(Measure %in% c("Social support", "Lack of social support")) %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "Total") %>% 
  filter(`Education level` == "Total")

```
Assigning filtered dataset to new variable.

```{r}

social_support_or_lack <- social_wellbeing %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>% 
  filter(Measure %in% c("Social support", "Lack of social support")) %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "Total") %>% 
  filter(`Education level` == "Total")

social_support_or_lack[1:2, ]

```


```{r}

barplot(table(social_support_or_lack$TIME_PERIOD), main = "frequency of years in the data")

```


```{r}

# find unique values for Measure column
sort(unique(social_wellbeing$Measure))


```
How is satisfaction with personal relationships scored?

```{r}
social_wellbeing %>% 
  filter(Measure == "Satisfaction with personal relationships score less than 5")
```

Inspect which Measures are listed with the 0 to 10 unit measure.

```{r}

social_wellbeing_0_10 <- social_wellbeing %>% 
  filter(UNIT_MEASURE == "0_TO_10")

social_wellbeing_0_10[1:2, ]

```


```{r}

sort(unique(social_wellbeing_0_10$Measure))

```

Limit the data to totals on sex, age, and education, in order to capture a population not a specific group within a population. As well, narrow down the date range and select only the Measures whose unit measure is 0-10. 

```{r}

satisfaction_life_rels <- social_wellbeing %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>% 
  filter(Measure %in% c("Life satisfaction", "Satisfaction with personal relationships")) %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "Total") %>% 
  filter(`Education level` == "Total")
  
satisfaction_life_rels[1:2, ]

```
```{r}

barplot(table(satisfaction_life_rels$TIME_PERIOD))

```

Looking into the feeling lonely measure.

```{r}


lonely <- social_wellbeing %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>%   
  filter(Measure == "Feeling lonely") %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "Total") %>% 
  filter(`Education level` == "Total")
  
lonely[1:2, ]

```

```{r}
barplot(table(lonely$TIME_PERIOD), main = "how often certain years are rep'd in data")
```

```{r}

num_countries_repd_in_lonely <- lonely %>% 
  group_by(`Reference area`) %>% 
  summarise(n=n())


ctrs_desc <- order(num_countries_repd_in_lonely$n, decreasing = TRUE)

num_countries_repd_in_lonely[ctrs_desc,][1:2, ]


```
```{r}
barplot(table(social_support_or_lack$TIME_PERIOD), main = "how often each year shows up in dataset" )
```
```{r}

social_interactions <- social_wellbeing %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>%   
  filter(Measure == "Time spent in social interactions") %>% 
  filter(Sex == "Total") %>% 
  filter(Age == "Total") %>% 
  filter(`Education level` == "Total")

social_interactions[1:2, ]

```
```{r}
barplot(table(social_interactions$TIME_PERIOD))
```
```{r}

social_interactions %>% 
  group_by(`Reference area`) %>% 
  summarise(n=n())

```



The final datasets from wellbeing data I have chosen are as follows:

```{r}

# measured as a percentage fo the adult population and is well represented across all years in selected data 
social_support_or_lack[1:2, ]

# measured as a percentage of the adult population, and is almost exclusively 2018 and 2022
lonely[1:2, ]

# years most represented in the data: 2013, 2018, 2022; it is measured on a 0-10 scale. 
satisfaction_life_rels[1:2, ]

# measured in units of hours per week, across various years, very difficult to compare, I will exclude this data for now
social_interactions[1:2, ]

```





### Safety


Reading in the second dataset, education levels.





```{r}

safety_regions <- read_csv("../data/safety_regions.csv")


```
```{r}
safety <- safety_regions %>% 
  filter(TIME_PERIOD %in% c(2007:2023)) %>% 
  filter(TERRITORIAL_LEVEL == "CTRY") %>% 
  filter(Sex %in% c("Total", "Not applicable")) %>% 
  filter(`Observation status` == "Normal value")

safety[1:2, ]

```
```{r}

barplot(table(safety$OBS_VALUE), main = "frequency of percentage of pop measured unsafe instances in the data")

```


```{r}
sort(unique(safety$`Observation status`))

```

The final dataset from safety data I have chosen are as follows:
```{r}
safety[1:2,]
```

Final datasets are:

```{r}

# perceived health status, subjective survey of good, fair, and bad health
phs[1:2, ]

# selected all adults from age, and all genders, and only observed values no standard error values
el[1:2, ]

# measured as a percentage fo the adult population and is well represented across all years in selected data 
social_support_or_lack[1:2, ]

# measured as a percentage of the adult population, and is almost exclusively 2018 and 2022
lonely[1:2, ]

# years most represented in the data: 2013, 2018, 2022; it is measured on a 0-10 scale. 
satisfaction_life_rels[1:2, ]

# measured in cases per 100,000 persons, well distributed across selected years
safety[1:2,]

```

###Merging data sets together into one

Order the data's country column (REF_AREA) alphabetically. 

```{r}

# Order the data's country column (`Reference area`) alphabetically.

phs_v1 <- phs[order(phs$`Reference area`),] 

el_v1 <- el[order(el$`Reference area`),]

safety_v1 <- safety[order(safety$`Reference area`),]

social_support_or_lack_v1 <- social_support_or_lack[order(social_support_or_lack$`Reference area`),]

lonely_v1 <- lonely[order(lonely$`Reference area`),]

satisfaction_life_rels_v1 <- satisfaction_life_rels[order(satisfaction_life_rels$`Reference area`),]
 


phs_v1 # percentage of the population
el_v1 # percentage of the population
safety_v1 # measured in cases per 100,000 persons
social_support_or_lack_v1 # percentage of the population
lonely_v1 # percentage of the population
satisfaction_life_rels_v1[1:2,]

```

rename columns

```{r}

phs_v2 <- rename(phs_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_phs", `Unit of measure` = "Unit_of_measure_phs", `Health status` = "Health_status", "OBS_VALUE" = "OBS_VALUE_PHS"))

el_v2 <- rename(el_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_el", `Unit of measure` = "Unit_of_measure_el", `Educational attainment level` = "Edu_attainment_lvl", "OBS_VALUE" = "OBS_VALUE_EL"))

safety_v2 <- rename(safety_v1, replace = c(`Reference area` = "Reference_area", `Frequency of observation` = "Freq_of_obs_sfty", `Unit of measure` = "Unit_of_measure_sfty", "Measure" = "Measure_safety", "OBS_VALUE" = "OBS_VALUE_SAFETY", "MEASURE" = "MEASURE_SAFETY"))

social_support_or_lack_v2 <- rename(social_support_or_lack_v1, replace = c(`Reference area` = "Reference_area", `Unit of measure` = "Unit_of_measure_social", "Measure" = "Measure_social", "OBS_VALUE" = "OBS_VALUE_SOCIAL", "MEASURE" = "MEASURE_SOCIAL"))

lonely_v2 <- rename(lonely_v1, replace = c(`Reference area` = "Reference_area", "OBS_VALUE" = "OBS_VALUE_LONELY", "Measure" = "Measure_lonely", "MEASURE" = "MEASURE_LONELY"))

satisfaction_life_rels_v2 <- rename(satisfaction_life_rels_v1, replace = c(`Reference area` = "Reference_area", "OBS_VALUE" = "OBS_VALUE_SAT", "Measure" = "Measure_sat", "MEASURE" = "MEASURE_SAT"))

```

```{r}
satisfaction_life_rels_v2[1:2,]
```


drop columns not anticipated to be needed or subset only the columns I will require


```{r}
# Columns to keep

phs_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_PHS", "Health_status", "HEALTH_STATUS")
el_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_EL", "Edu_attainment_lvl", "ATTAINMENT_LEV")
safety_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SAFETY", "Measure_safety", "MEASURE_SAFETY")
social_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SOCIAL", "Measure_social", "MEASURE_SOCIAL")
lonely_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_LONELY", "Measure_lonely", "MEASURE_LONELY")
satisfaction_columns_to_keep <- c("REF_AREA", "Reference_area", "TIME_PERIOD", "OBS_VALUE_SAT", "Measure_sat", "MEASURE_SAT")

phs_v3 <- subset(phs_v2, select = phs_columns_to_keep)
el_v3 <- subset(el_v2, select = el_columns_to_keep)
safety_v3 <- subset(safety_v2, select = safety_columns_to_keep)
social_support_or_lack_v3 <- subset(social_support_or_lack_v2, select = social_columns_to_keep)
lonely_v3 <- subset(lonely_v2, select = lonely_columns_to_keep)
satisfaction_life_rels_v3 <- subset(satisfaction_life_rels_v2, select = satisfaction_columns_to_keep)

```


```{r}

phs_v3

el_v3

safety_v3

social_support_or_lack_v3

lonely_v3

satisfaction_life_rels_v3

```

Pivot all data sets prior to merging them.

```{r}

phs_pivoted <- phs_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("HEALTH_STATUS"),
    names_prefix = "health_status_",
    values_from = "OBS_VALUE_PHS"
    )

phs_pivoted

```


```{r}

el_pivoted <- el_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("Edu_attainment_lvl"),
    values_from = "OBS_VALUE_EL"
    )


el_pivoted

```




```{r}

safety_pivoted <- safety_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("Measure_safety"),
    values_from = "OBS_VALUE_SAFETY"
    )


safety_pivoted

```

```{r}

social_support_or_lack_pivoted <- social_support_or_lack_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("Measure_social"),
    values_from = "OBS_VALUE_SOCIAL"
    )

social_support_or_lack_pivoted

```


```{r}

lonely_pivoted <- lonely_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("Measure_lonely"),
    values_from = "OBS_VALUE_LONELY"
    )

lonely_pivoted

```


```{r}

satisfaction_life_rels_pivoted <- satisfaction_life_rels_v3 %>% 
  pivot_wider(
    id_cols = c("REF_AREA", "Reference_area", "TIME_PERIOD"), 
    names_from = c("Measure_sat"),
    values_from = "OBS_VALUE_SAT"
    )

satisfaction_life_rels_pivoted

```



Merge all 6 data sets


```{r}

phs_el <- merge(
  phs_pivoted,
  el_pivoted,
  by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
  all = TRUE
)

phs_el

```

```{r}

phs_el_safety <- merge(
  phs_el,
  safety_pivoted,
  by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
  all = TRUE
)

phs_el_safety

```


```{r}

phs_el_safety_social <- merge(
  phs_el_safety,
  social_support_or_lack_pivoted,
  by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
  all = TRUE
)

phs_el_safety_social

```

```{r}

phs_el_safety_social_lonely <- merge(
  phs_el_safety_social,
  lonely_pivoted,
  by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
  all = TRUE
)

phs_el_safety_social_lonely

```

```{r}

phs_el_safety_social_lonely_satisfaction <- merge(
  phs_el_safety_social_lonely,
  satisfaction_life_rels_pivoted,
  by = c("REF_AREA", "Reference_area", "TIME_PERIOD"),
  all = TRUE
)

phs_el_safety_social_lonely_satisfaction

```

Use fill from tidyr to fill, up and down, the lonely data

```{r}

phs_el_safety_social_lonely_satisfaction_fill <- phs_el_safety_social_lonely_satisfaction |>
  group_by(Reference_area) |>
  fill(`Feeling lonely`, .direction = "updown") |>
  ungroup()

phs_el_safety_social_lonely_satisfaction_fill

```

```{r}

phs_el_safety_social_lonely_satisfaction_filled <- phs_el_safety_social_lonely_satisfaction_fill |>
  group_by(Reference_area) |>
  fill(`Satisfaction with personal relationships`, .direction = "updown") |>
  ungroup()

phs_el_safety_social_lonely_satisfaction_filled

```


```{r}

df <- phs_el_safety_social_lonely_satisfaction_filled |>
  group_by(Reference_area) |>
  fill(`Life satisfaction`, .direction = "updown") |>
  ungroup()

df

```

Final version of the dataset to be used in the shiny app.

```{r}

df %>% view()
```

```{r}
# write.csv(df, file = "../data/df.csv")
```

```{r}
hist(df$`Feeling lonely`, xlab="lonely")
```
```{r}

sort(unique(df$`Feeling lonely`))


class(df)
```

```{r}
max(df$`Feeling lonely`, na.rm = TRUE)
```
```{r}
df %>% view()

```


Social and Community connection pulled from the Wellbeing dataset
Reading in the third dataset, social connectedness from the wellbeing dataset.
View the data.
Inspect the UNIT_MEASURE column’s unique values.
Assigning filtered dataset to new variable.
How is satisfaction with personal relationships scored?
Inspect which Measures are listed with the 0 to 10 unit measure.
Limit the data to totals on sex, age, and education, in order to capture a population not a specific group within a population. As well, narrow down the date range and select only the Measures whose unit measure is 0-10.
Looking into the feeling lonely measure.
The final datasets from wellbeing data I have chosen are as follows: